Appendix A — Regression prediction problem: Common mistakes

Below is a sample solution to the regression prediction problem that consists of conceptual mistakes, and semantic errors. This highlights some of the common mistakes that students make in their solutions.

Step 0

Assuming missing value imputation, and data cleaning (such as converting price to numeric, etc.) has been done already. The cleaned train and test datasets are train_clean, and test_clean respectively.

%reset

# Imputing missing values & cleaning data
%run "missing_value_imputation.ipynb"
Once deleted, variables cannot be recovered. Proceed (y/[n])? y

Step 1

Response transformation

Let us visualize the distribution of the response price.

sns.histplot(train_clean.price)
plt.xlim([0,2000]);

As the response price is right-skewed, we will take the log-transform to reduce the skew.

train_clean['log_price'] = np.log(train_clean.price)

No mistake!

However, if you have mentioned something like the following, it is fine.

When plotting the price to determine if it needed a transformation you set the x-axis limit to a max of 2000, this will prevent you from understanding the extent of the skew in the data, and also prevents you from noting what may be some extreme outliers in the data (like a value of 99,000) which may require further exploration if they turn out to be influential points. The solution would be to not include a limit in the plot. Using log of price to solve the right skew issue is the correct transformation.

Step 2

Capping outliers

As outliers may distort the regression model, let us cap the outlying values of the transformed response.

#Finding upper and lower quartiles and interquartile range
q1 = np.percentile(train_clean['log_price'],25)
q3 = np.percentile(train_clean['log_price'],75)
intQ_range = q3-q1
#Tukey's fences
Lower_fence = q1 - 1.5*intQ_range
Upper_fence = q3 + 1.5*intQ_range
# Capping the outlying values
train_clean.loc[(train_clean.log_price < Lower_fence), 'log_price'] = Lower_fence
train_clean.loc[(train_clean.log_price > Upper_fence), 'log_price'] = Upper_fence

Mistake 1: Should find outliers with respect to the model (1 point)

Capping outliers is the wrong approach to take because outliers may not have much impact on the data if they are not influential points, and we may be able to explain these outliers easily in our model. For example we may find that these outliers have extremely large values for accommodates and this may be why they have such high values. Removing them prevents the model from addressing what may be things it easily could in a model. Removing points that could have been explained by the model will mean we will likely underestimate similar data points in the test data set.

Step 3

Combining levels of categorical predictors with very few observations

Some levels of categorical variables may have very few observations, which may lead to unreliable estimates of their regression coefficients. Thus, we will merge such levels into an ‘others’ category.

There are 76 levels of neighbourhood_cleansed some of which have very few observations.

train_clean.neighbourhood_cleansed.value_counts().shape
(76,)
train_clean.neighbourhood_cleansed.value_counts().tail()
South Deering    2
West Elsdon      2
Riverdale        1
Gage Park        1
Edison Park      1
Name: neighbourhood_cleansed, dtype: int64
test_clean.neighbourhood_cleansed.value_counts().shape
(76,)
test_clean.neighbourhood_cleansed.value_counts().tail()
Avalon Park        1
South Deering      1
Mount Greenwood    1
Edison Park        1
Chicago Lawn       1
Name: neighbourhood_cleansed, dtype: int64

Let us merge levels of neighbourhood_cleansed that have less than 40 observations. Assume that 40 is a reasonable cut-off. There is no mistake in the choice of this cut-off.

# Merging levels of neighbourhood_cleansed that have less than 40 observations in train data
train_clean['neighbourhood_cleansed'] = train_clean[['id','neighbourhood_cleansed']].groupby(['neighbourhood_cleansed'], 
    group_keys=False).transform(lambda x:'others' if x.count() < 40 else train_clean.loc[x.index,'neighbourhood_cleansed'])
# Merging levels of neighbourhood_cleansed that have less than 40 observations in test data
test_clean['neighbourhood_cleansed'] = test_clean[['id','neighbourhood_cleansed']].groupby(['neighbourhood_cleansed'], 
    group_keys=False).transform(lambda x:'others' if x.count() < 40 else test_clean.loc[x.index,'neighbourhood_cleansed'])

Similarly, we can merge levels of all such categorical variables, using appropriate cut-offs.

Mistake 2: Should keep neighbourhoods in test that are in train, instead of using the cut-off used in train (0.5 points)

Some neighbourhoods that have more than 40 observations in train data may have less than 40 observations in test data. Such neighbourhoods will be renamed as ‘others’ in the test data, but not in the train data, which will lead to different distinct dummy variables for neighbourhoods in train and test data. To rectify that, you may create columns for those neighbourhoods in test data, and set all values as 0. However, that will be inaccurate because those neighbourhoods actually have listings in the test data, but they were renamed as ‘others’. So, the correction will be to simlpy rename all those neighbourhoods as ‘others’ in test data that are renamed as ‘others’ in the train data.

Step 4

Dummy variables

Let us convert categorical variables to dummy variables, as we intend to develop a ridge regression model. We will use the argument drop_first = True in the Pandas function get_dummies() as it reduces the size of the dataset without losing any information from the data.

# Train data
train_clean = pd.get_dummies(train_clean, drop_first = True)

# Cleaning column names
train_clean.columns = train_clean.columns.str.replace(' ', '_')
train_clean.columns = train_clean.columns.str.replace('-', '_')
train_clean.columns = train_clean.columns.str.replace('/', '_')

# Test data
test_clean = pd.get_dummies(test_clean, drop_first = True)

# Cleaning column names
test_clean.columns = test_clean.columns.str.replace(' ', '_')
test_clean.columns = test_clean.columns.str.replace('-', '_')
test_clean.columns = test_clean.columns.str.replace('/', '_')

Let us check if we have the same number of columns in the train and test data.

train_clean.shape
(5000, 96)
test_clean.shape
(3338, 91)
# Columns in train data that are not in test data
np.setdiff1d(train_clean.columns, test_clean.columns)
array(['log_price', 'neighbourhood_cleansed_Avondale',
       'neighbourhood_cleansed_Douglas',
       'neighbourhood_cleansed_Lincoln_Square', 'price'], dtype=object)

There are listings in 3 neighbourhoods in the train data that must also be in test data. Let us create the columns for those neighbourhoods in test data so that we have the same columns in both train and test datasets.

test_clean['neighbourhood_cleansed_Douglas'] = 0
test_clean['neighbourhood_cleansed_Lincoln_Square'] = 0
test_clean['neighbourhood_cleansed_Avondale'] = 0

No mistake!

However, if you have explained the mistake of the previous step in this step, it is fine.

Step 5

Ordinal variables

Here is an idea to further reduce the size of the dataset without losing any information. We will use the dummy variables to create an ordinal variable, which will have the information of all the dummy variables that correspond to the same categorical variable.

Let us replace the dummy variables of room_type with an ordinal variable.

####----Train data processing---------####

# making one big room_type column including all of the room types 
train_clean['room_type'] = (train_clean['room_type_Hotel_room'] * 1 +
                   train_clean['room_type_Private_room'] * 2 + train_clean['room_type_Shared_room'] * 3 + 
                           (1-(train_clean['room_type_Hotel_room']+ train_clean['room_type_Private_room'] + \
                               train_clean['room_type_Shared_room']))*4  )

# Drop the dummy variables
train_clean.drop(columns = ['room_type_Hotel_room', 'room_type_Private_room', 'room_type_Shared_room'],inplace=True)


####----Test data processing---------####

# making one big room_type column including all of the room types 
test_clean['room_type'] = (test_clean['room_type_Hotel_room'] * 1 +
                   test_clean['room_type_Private_room'] * 2 + test_clean['room_type_Shared_room'] * 3 + 
                           (1-(test_clean['room_type_Hotel_room']+ test_clean['room_type_Private_room'] + \
                               test_clean['room_type_Shared_room']))*4  )

# Drop the dummy variables
test_clean.drop(columns = ['room_type_Hotel_room', 'room_type_Private_room', 'room_type_Shared_room'],inplace=True)

Similarly, other dummy variables can be converted to ordinal variables to reduce data size without losing any information.

Mistake 3: Creating unreasonable constraint (1 point)

Ordinal variables are not appropriate in this scenario because this relies on the assumption that there is an inherent hierarchy to the dummy variable which is not true. This step should be skipped (or only be applied to things we know are hierarchical in nature). Even if there was a hierarchy, another constraint it adds is that the difference between the expected response for any two consecutive levels of the hierarchy is the same.

Step 6

Scaling data

As we plan to develop a ridge regression model, we will scale predictors.

X_train = train_clean.drop(columns = ['price', 'log_price', 'id', 'host_id'])
X_test = test_clean.drop(columns = ['id', 'host_id'])
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

Let us check the shapes of train and test data to see if they are consistent.

X_train.shape
(5000, 90)
X_test.shape
(3338, 90)

Train and test datasets are consistent with regard to columns!

Mistake 4: Must use transform() on test data (1 point)

fit_transform should not be used on test since we must scale the test data based on the mean and variance of the columns of the train data, and not the test data.

Step 7

Two-factor interactions

Let us include two-factor interactions of all predictors.

poly = PolynomialFeatures(2, include_bias = False)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)

Mistake 5: Must ensure that predictors are in the same order in train and test (0.5 points)

Note that three predictors (or columns) were added to test data in step 4. These columns will be added to the extreme right hand side of the test data set. This implies that the order in which the columns appear in the train and test data is different. This, in turn, implies that the order in which the columns appear in the scaled train and test data sets is also different. However, we lose the column names in the scaled datasets in Step 6. So, the function transform() used here doesn’t throw an error that the columns must be in the same order, and creates the interactions. However, the interactions created in the train and test datasets are in a different order, which will lead to incorrect predictions on the test data.

Step 8

Model hyperparameter optimization

Let us find the optimal value of the regularization parameter for a ridge regression model.

alphas = np.logspace(2,0.5,2)
modelcv = RidgeCV(alphas = alphas, scoring = 'neg_root_mean_squared_error').fit(X_train_poly, train_clean.log_price)
modelcv.alpha_
100.0

Mistake 6: Should expand search space (1 point)

If the optimal hyperparameter value is found at the edge of the search space, then the search space must be expanded in that direction. The cost function is highly likely to be minimized further if we continue search in the direction in which the cost function is decreasing.

Step 9

Cross-validation

Let us find the 5-fold cross validated root mean squared error (RMSE) to check if the model with the optimal regularization parameter is good, before making predictions.

np.exp(np.mean(-cross_val_score(Ridge(alpha = modelcv.alpha_), X_train_poly, 
                train_clean.log_price, scoring = 'neg_root_mean_squared_error', n_jobs = -1)))
30.617446257505094

The 5-fold cross-validated RMSE is only around $30. The model seems to be good!

Mistake 7: Incorrect back-transformation to units of response (1 point)

cross_val_score() returns 5 errors in the units of log price. Taking the exponential of the averge of these errors does not convert the error into the units of the response. Here, the function cross_val_predict() needs to be used to get the predictions in units of log price, then those predictions should be exponentiated to get them in the units of price, and then the cross-validated error must be obtained by comparing the cross-validated predictions in the units of price to the actual untransformed price.

Step 10

Model predictions

Let us use the model corresponding to the optimal regularization parameter value to make predictions.

test_predictions = np.exp(modelcv.predict(X_test_poly))

No mistake

Order of steps

Mistake 8: (1 point)

Steps 7 must come before step 6, predictors must be scaled after including the two-factor interactions.